Random Forest in Machine Learning

Forest Foresight

Mika Goins
Matt McGehee
Stutti Smit-Johnson
Advisor: Dr. Seals

Random Forest in Machine Learning: Foresight from the ForestS

A Random Forest Guided Tour

by Gérard Biau and Erwan Scornet [1]

  • Origin & Success: Introduced by Breiman (2001) [2], Random Forests excel in classification/regression, combining decision trees for strong performance.
  • Versatility: Effective for large-scale tasks, adaptable, and highlights important features across various domains.
  • Ease of Use: Simple with minimal tuning, handles small samples and high-dimensional data.
  • Theoretical Gaps: Limited theoretical insights; known for complexity and black-box nature.
  • Key Mechanisms: Uses bagging and CART-split criteria for robust performance, though hard to analyze rigorously.

Tree Prediction

Each tree estimates the response at point \(x\) as:

\[ m_n(x; \Theta_j, D_n) = \frac{\sum_{i \in D_n(\Theta_j)} \mathbf{1}_{X_i \in A_n(x; \Theta_j, D_n)} Y_i}{N_n(x; \Theta_j, D_n)} \]

  • \(D_n(\Theta_j)\) is the resampled data subset,
  • \(A_n(x; \Theta_j, D_n)\) is the cell containing \(x\), and
  • \(N_n(x; \Theta_j, D_n)\) is the count of points in the cell

Random Forest Classification

Splitting Criteria:

  • The Gini impurity measure is used to determine the best split:

\[ G = 1 - \sum_{k=1}^{K} p_k^2 \]

  • \(p_k\) represents the proportion of samples of class \(k\) in the node.
  • \(K\) is the number of classes.

Prediction:

  • Each tree makes a prediction using the majority class in the cell containing \(x\).
  • Classification uses a majority vote:

\[ m_{M, n}(x; \Theta_1, \ldots, \Theta_M, D_n) = \begin{cases} 1 & \text{if } \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) > \frac{1}{2} \\ 0 & \text{otherwise} \end{cases} \]

  • \(m_n(x; \Theta_j, D_n)\): Prediction from the \(j\)-th tree.
  • \(M\): Total number of trees in the forest.

Like this…

[3]

The Data

  1. Where the data came from
  2. Size of data
  3. Key variables - Demographic, Behavioral, Seasonal
  4. Preprocessing needed

SSI Sales Data

Forest Foresight - SSI Sales Data
Total Orders Closed Short Fulfilled
(n=7585) (n=733) (n=6852)
Top Customers
Smoothie Island 1701 (22.43%) 455 (62.07%) 1246 (18.18%)
Philly Bite 1556 (20.51%) 267 (36.43%) 1289 (18.81%)
PlatePioneers 1396 (18.40%) 143 (19.51%) 1253 (18.29%)
Berl Company 906 (11.94%) 5 (0.68%) 901 (13.15%)
DineLink Intl 589 (7.77%) 42 (5.73%) 547 (7.98%)
Top Products
DC-01  (Drink carrier) 1135 (14.96%) 345 (47.07%) 790 (11.53%)
TSC-PQB-01  (Paper Quesadilla Clamshell)    1087 (14.33%) 389 (53.07%) 698 (10.19%)
TSC-PW14X16-01  (1-Play Paper Wrapper) 848 (11.18%) 283 (38.61%) 565 (8.25%)
CMI-PCK-01  (Wrapped Plastic Cutlery Kit) 802 (10.57%) 288 (39.29%) 514 (7.50%)
PC-05-B1  (Black 5oz Container) 745 (9.82%) 220 (30.01%) 525 (7.66%)

Sales over Time

Analysis - Stutti

Predicting Customer Churn

  • The churn indicator was created based on the Last Sales Date (0/1).
  • Predictors: Class, Product, Qty Ordered, and Date Fulfilled.
  • The model was evaluated using statistics from the Confusion Matrix.
  • 80% Accuracy achieved:
    • Sensitivity: The model correctly identifies 78.6% of the actual 0 cases.
    • Specificity: The model correctly identifies 88.12% of the actual 1 cases.
    • Negative Predictive Value (NPV for class 1): When the model predicts 1, it is correct only 47.62% of the time. This lower NPV suggests the model might be missing some 1 cases.
    • McNemar’s Test P-value (<2e-16): Indicates that the model struggles slightly with misclassification between classes.
  • Conclusion: Overall, the model has a good balance (0.8336) between identifying both classes, though it is better at predicting class 0.

Analysis - Matt

Goal:

Predict whether an OPCO (distributor) falls within the top 25% of revenue using the quantity ordered, product, and substrate to try and identify key distribution channels where the company could focus marketing and advertising dollars.

Confusion Matrix

  • 0: Non-high-revenue OPCO
  • 1: High-revenue OPCO

Model Statistics

Metric Value
Accuracy Accuracy 0.956
95% CI (0.951, 0.96)
Kappa Kappa 0.73
Sensitivity Sensitivity 0.66256
Specificity Specificity 0.98911
Pos Pred Value Pos Pred Value 0.87297
Neg Pred Value Neg Pred Value 0.96291
Prevalence Prevalence 0.10146
Detection Rate Detection Rate 0.06722
Balanced Accuracy Balanced Accuracy 0.82583

ROC Curve Analysis

Figure 1: ROC Curve for High Revenue Prediction

Feature Importance

Figure 2: Feature Importance for High Revenue Prediction

Analysis - Mika

Conclusion

  • High Performance & Versatile: Robust, accurate, handles noisy/high-dimensional data. [1]
  • Ensemble Strength: Averaging over many trees; identifies key variables. [1]
  • Challenges: Limited theoretical understanding; complex interpretation. [1]
  • Future Focus: Enhance theory, increase interpretability, broaden applications. [1]

References

[1]
G. Biau and E. Scornet, “A random forest guided tour,” Test (Madr.), vol. 25, no. 2, pp. 197–227, Jun. 2016.
[2]
[3]
Y. Fu, “Combination of random forests and neural networks in social lending,” Journal of Financial Risk Management, vol. 6, no. 4, pp. 418–426, 2017, doi: 10.4236/jfrm.2017.64030.